[TLDR] Hardware-Efficient Attention for Fast Decoding

Optimizing Latency and Throughput in LLM Decoding with GTA & GLA

Published

May 27, 2025

Authors: T. Zadouri et al. Published on Arxiv: 2025-05-27 Link: http://arxiv.org/abs/2505.21487v1 Institutions: Department of Computer Science, Princeton University • Princeton Language and Intelligence, Princeton University Keywords: hardware-efficient attention, KV cache, arithmetic intensity, Grouped-Tied Attention (GTA), Grouped Latent Attention (GLA), Multi-head Latent Attention (MLA), Grouped-Query Attention (GQA), tensor parallelism, paged KV, FlashAttention3, speculative decoding, latency, throughput, LLM inference, FineWeb-Edu-100B, NVIDIA H100, memory-bound workload

Large language model (LLM) decoding faces performance bottlenecks due to high memory bandwidth requirements for KV cache retrieval, especially as context length and batch size grow. Existing attention mechanisms, like Multi-Head Attention (MHA), are increasingly suboptimal as compute capabilities outpace memory bandwidth improvements, leaving modern inference workloads memory-bound and limiting acceleration potential on new hardware.

To address these hardware challenges, the authors propose and evaluate novel attention mechanisms designed to maximize arithmetic intensity and parallelizability without quality loss:

Following these innovations, the authors perform quantitative and qualitative benchmarks to demonstrate the advantages:

Synthesizing these results, the study draws its main conclusions: